Clustering by Authorship Within and Across Documents

نویسندگان

Efstathios Stamatatos

Michael Tschuggnall

Ben Verhoeven

Walter Daelemans

Günther Specht

Benno Stein

Martin Potthast

چکیده

The vast majority of previous studies in authorship attribution assume the existence of documents (or parts of documents) labeled by authorship to be used as training instances in either closed-set or open-set attribution. However, in several applications it is not easy or even possible to find such labeled data and it is necessary to build unsupervised attribution models that are able to estimate similarities/differences in personal style of authors. The shared tasks on author clustering and author diarization at PAN 2016 focus on such unsupervised authorship attribution problems. The former deals with single-author documents and aims at grouping documents by authorship and establishing authorship links between documents. The latter considers multi-author documents and attempts to segment a document into authorial components, a task strongly associated with intrinsic plagiarism detection. This paper presents an overview of the two tasks including evaluation datasets, measures, results, as well as a survey of a total of 10 submissions (8 for author clustering and 2 for author diarization).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automated unsupervised authorship analysis using evidence accumulation clustering

Authorship Analysis aims to extract information about the authorship of documents from features within those documents. Typically, this is performed as a classification task with the aim of identifying the author of a document, given a set of documents of known authorship. Alternatively, unsupervised methods have been developed primarily as visualisation tools to assist the manual discovery of ...

متن کامل

Multi Feature Space Combination for Authorship Clustering

The Author Identification task for PAN 2016 consisted of three different Sub-tasks: authorship clustering, authorship links and author diarization. We developed a machine learning approaches for two of three of these tasks. For the two authorship related tasks we created various sets of feature spaces. The challenge was to combine these feature spaces to enable the machine learning algorithms t...

متن کامل

Efficient Unsupervised Authorship Clustering Using Impostor Similarity

Some real-world authorship analysis applications require techniques that scale to thousands of documents with little or no a priori information about the number of candidate authors. While there is extensive research on identifying authors given a small set of candidates and ample training data, almost none is based on real-world applications of clustering documents by authorship, independent o...

متن کامل

Authorship Clustering using Multi-headed Recurrent Neural Networks

A recurrent neural network that has been trained to separately model the language of several documents by unknown authors is used to measure similarity between the documents. It is able to find clues of common authorship even when the documents are very short and about disparate topics. While it is easy to make statistically significant predictions regarding authorship, it is difficult to group...

متن کامل

Author Clustering based on Compression-based Dissimilarity Scores

The PAN 2017 Author Clustering task examines the two application scenarios complete author clustering and authorship-link ranking. In the first scenario, one must identify the number (k) of different authors within a document collection and assign each document to exactly one of the k clusters, where each cluster corresponds to a different author. In the second scenario, one must establish auth...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Clustering by Authorship Within and Across Documents

نویسندگان

چکیده

منابع مشابه

Automated unsupervised authorship analysis using evidence accumulation clustering

Multi Feature Space Combination for Authorship Clustering

Efficient Unsupervised Authorship Clustering Using Impostor Similarity

Authorship Clustering using Multi-headed Recurrent Neural Networks

Author Clustering based on Compression-based Dissimilarity Scores

عنوان ژورنال:

اشتراک گذاری